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Abstract 

The focus of the tool FTOS is to alleviate designers' burden by offering code generation for non-functional aspects 
including fault-tolerance mechanisms. One crucial aspect in this context is to ensure that user-selected mechanisms for the 
system model are sufficient to resist faults as specified in the underlying fault hypothesis. In this paper, formal approaches 
in verification are proposed to assist the claim. We first raise the precision of FTOS into pure mathematical constructs, and 
formulate the deterministic assumption, which is necessary as an extension of Giotto-like systems (e.g., FTOS) to equip 
with fault-tolerance abilities. We show that local properties of a system with the deterministic assumption will be preserved 
in a modified synchronous system used as the verification model. This enables the use of techniques known from hardware 
verification. As for implementation, we develop a prototype tool called FTOS-Verify, deploy it as an Eclipse add-on for 
FTOS, and conduct several case studies. 



I. Introduction 

Fault-tolerant systems refer to systems with the ability to withstand transient or permanent faults; these faults may 
be caused by design errors, hardware failures or environmental impacts. Applications domains are amongst others the 
medical, avionic or automation domain. The introduction of fault-tolerance abilities into embedded design brings two 
potential issues compared to standard systems. 

• Fault-tolerance mechanisms require redundancy; additional means (hardware, information, etc.) are not necessary 
to implement the actual functionality during fault-free operation. From a supplier's view, it is always desirable to 
equip the system with "just enough" fault-tolerance abilities, i.e., existing mechanisms should be sufficient for the 
resistance of faults based on the underlying fault hypothesis, but extra mechanisms should not be introduced due 
to cost reasons. 

• In addition, verification is of special interest in comparison to standard systems, since failures might lead to severe 
damages or even endangerment of life. 

• Furthermore, extra time required for validation might postpone the time-to-market schedule. 

To alleviate the designers' burden on above issues, in this paper we concentrate on systematic methodologies to integrate 
automatic formal verification (in particular, verifying non-functional properties regarding the validity of fault-tolerance 
mechanisms) into the design process of fault-tolerant systems, since it is regarded as a rigorous method to guarantee 
correctness. We use FTOS |4| as our target, which is a model-based tool for the development of fault-tolerant real-time 
systems. In the setting of FTOS, the designers can select predefined (or self-defined) fault-tolerance mechanisms during 
their design of their system models, and the corresponding code is generated automatically by the tool. Our goal is to 
step further by reporting designers a proof whether the equipped mechanisms are sufficient to resist the fault as specified 
in the fault hypothesis. 

Our first job is to raise the precision of FTOS into pure mathematical constructs. To achieve this purpose, we propose 
a formal mathematical model called global-cycle-accurate (GCA) system (section |ll]i, which captures the essence of 
such systems. Intuitively, a GCA system can be viewed as an extension from Giotto systems |9| based on the model of 



computation Logical Execution Time (see section I-A for concept description) equipped with redundancies. Nevertheless, 
challenges for verifying such systems remain. 

• First, it is difficult to construct the verification model due to the inherent behavior of the MoC, because a GCA 
system is synchronous in logical (global-tick) level while asynchronous in action (micro-tick) level. 

• Secondly, intuitive extensions of redundancy break internal determinism originally maintained in Giotto-like 
systems. 

• Lastly, mapping from models to different platforms (synchronous, asynchronous) while preserving the property is 
non-intuitive. 



To overcome these challenges, we propose the concept of the deterministic assumption (section llli, which relates 



all deployed platforms with common features. We discuss impacts of the deterministic assumption in GCA systems 



(section IV i. Most importantly, with deterministic assumption, some properties are preserving among all deployed 



platforms, and we can construct a simplified model for verification, where property checking can be achieved by 
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Figure 1. Two Giotto deployments with different scheduling policies. 
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Figure 2. Parameterized actions defined in the pattern, and an instantiation with n = 3. 



examining a smaller set of reachable state space. The result also holds with additional assumptions regarding the 
introduction of faults (section [V]i. 

With the knowledge, we implement an Eclipse add-on called FTOS-Verify, which enables the use of formal verification 



under FTOS, and outline our case study (section VI VII The template-based approach enables the software and the 
templates to be extended easily, for the introduction and automatic verification for fault-tolerance mechanisms. Also 
features in FTOS-Verify enable non-experts in verification to use the tool with ease. We mention related work and give 
the conclusion in section IVIIII and HXl 

A. The Concept of Logical Execution Time 

The concept of fixed logical execution time (FLET), which is widely accepted by some design tools, e.g., Giotto, 
TDL iT9l . HTL ||6l, is the primary technique used in FTOS and thus required to be mentioned forehand. The basic 
motivation for FLET is based on the synchrony assumption of synchronous languages, which assumes an infinitely fast 
underlying hardware. For a valid implementation, the actions of a logical moment must be executed before the next 
logical moment appears; right before a logical moment, the system behaves deterministic. However, when observing the 
behavior at the level of individual actions, no assumptions can be made. 

When applying this concept, the designer only specifies the start time tstart and finish time tend (mostly it is periodic). 
Contrary to the traditional view that an output should be ready as soon as possible, the FLET compiler will ensure that 
the output is observable at <end but no earlier. Within the guarantee, when multiple systems execute in one machine, 
the compiler can also allow preemptive scheduling. Under the concept of FLET, internal determinism is guaranteed, 
meaning that the relative ordering of operations is the factor to derive deterministic results, irrelevant of real time. Fig. [T] 
illustrates that in Giotto, two deployments with same relative ordering behave the same. 

II. GCA SYSTEMS WITH REDUNDANCIES: SYNTAXES AND SEMANTICS 

To describe the FTOS execution model formally, in the following we define several terms, and introduce a simple 
mathematical construct called GCA systems. 

Definition 1. Define the pattern of the machine with coefficient {x^n) be Ax.n = (V U Venv^;^,Qx)- 

• V is the set of arrays with length n, and V^nv^ the set of environment variables. 

• (7 := (Ti; (T2; . . . ; CTfc is a fixed sequence of actions, where for all j — 1 . . . k, aj :— send(a[a;]) | a[x] <— 
e I receive(a[a; © 1], ... , a[x ® n]) is the atomic action; © is the modulo-plus operator over n, a[x\ is the x-th 
element of the array a, and e is an operation over variables in V^nv^ U Viei n aev '^[^ ® 

• Qx is the message queue. 

Definition 2. Define the redundant system Sr with n-redundancy be Vi=i „-^i,n- 

We explain the intuitive meaning of the formal syntax with the assistance of fig. |2] In the definition of a pattern, n 
is used to represent the number of redundancies deployed, and x is the index of the machine. Fault tolerance is mostly 
achieved by voting of the same variable from different machines, and in FTOS it is done distributively in each machine. 
Therefore, for the array b[l, . . . ,n], in machine x we use b[x] to store its own value, while other variables in array 
b[l, . . . ,n] are used as the stored copy sent by other machines. In fig. |2] sequence of actions are instantiated based on 
different machine index from the actions specified in the pattern. 

Definition 3. The configuration of Sr = Vi=i n-^«/" {{vitCIit ^nexti)T ■ ■ ,{vmqn, ^nextri))- For each machine 
i — I . . .n, 

• Vi is the set of the current values for the variable set V. 



• Qi is the current content for the local message queue Qi. 

• Let atomic((7) be the set of atomic operations in the action sequence a, then \nexti G atomic((f) U {null} 
is the next atomic action taken in a. 

Definition 4. The change of configuration Sr — Vj=i n •^i,n is caused by the following operations. 

1) For machine i, let s and aj be the current configuration for a[i] and Anexti- A.n action aj :— a[i] ^ e updates 
a[i] from s to e, and changes Anexti to (Jj+i if j ~ \ . . .k — \ and null otherwise. 

2) For machine i, let d and aj be the current configuration for a[i] and Anexti- all k — I . . . n, k ^ i, let q^. 
be the configuration of the local queue in machine k. An action aj :~ send(a[i]) updates the queue from to 
qk o S,), and changes Anexti '^j+i 'f J — I ■ ■ . k — I and null otherwise. 

3) For machine i,for all j — 1 . . .n, let s[j] be the current configuration for a[j]. Then an action Uj receive(a[i® 
1], . . . , a[? © n]) performs the following updates. 

• This following step is done iteratively over i' — I . . . n,i' =/= i. 

• Let the current configuration of the local queue be qi — msgi o msg2 . . . o (a[i'], ai) o . . . o (a[i'], aiast) ■ ■ ■° 
msgk. Let (a[i'], ai); . . . ; (a[«'], aiast) be the subsequence of messages with variable a[i'] in the queue. Then 
Ai^k updates Qi by msgi o msg2 . . . o . . .msgk, ond updates variables a[i'] by aiast- Note if no related 
message exists, then variables will not be updated. 

• It updates Anexti from aj to ctj+i ;/ j = 1 . . . A: — 1 and null otherwise. 

Definition 5. Define a GCA (global-cycle-accurate) system with n- redundancies over Sr as S — (Sr, Ax,C). 

• Sr = Vi=i...„-^^"- 

• At is the global periodic jump with parameter T. 

• C is the global clock. 

Intuitively, the semantics of GCA systems is that on the global level, for each machine Ai the scheduling sequence is 
constrained by T; starting from time equals to zero, for every T time units, it should complete the sequence of actions 
defined by a. 

Definition 6. The change of configuration of S ^ [Sr^ Aq-,C) is caused by following operations. 

1) Actions defined by Sr. 

2) For all X — \ . . .n. At reads \J ^^^y a,[x\, updates Venv^, ond resets Anexti to ai, respectively. 

3) The clock reading of C changes as time advances. 

Definition 7. A GCA (global-cycle-accurate) system must satisfy the constraint: Starting from t — Q, At is always 
activated with the period of T. When At terminates, configurations over A^extx , Anext2 1 ■ • ■ ^next„ ore always with 
values null, null, . . . , null, respectively. 

A. Giotto 

The model of computation of Giotto-like systems, e.g., Giotto, TDL, or HTL, can thus be viewed as the case where 
every component of such system is a GCA system without (1) redundancies and (2) send and receive actions, where 
T can be viewed as the logical deadline. 

III. Deterministic Assumption 
A. Deterministic Assumption in GCA Systems 

Nevertheless, when applying the concept of logical execution time to GCA systems where n is not 1, intuitive 
extensions for scheduling policies from Giotto make internal determinism no longer guaranteed. The reason is that 
machines need to communicate to each other to derive consensus results. However, different scheduling policies (while 
the relative ordering is the same) differ from the result. Fig. |3] illustrates this idea. When Mi, M2, and are three 
machines which implement the triple modular redundancy functionalities, liveness messages sent by A/3 will be received 
by All and M2. However, for M3, due to OS scheduling decisions its execution trace can change to that of A/4. If so, 
messages might not be received successfully, implying that the overall system behavior differs (internal nondeterminism). 

The violation of internal determinism hinders the portability of the model. For example, the result of fault-tolerance 
mechanisms may differ simply due to scheduling policies on different machines (this is undesired). To solve this problem, 
we thus propose the concept of deterministic assumption - it is an additional constraint where every deployment should 
ensure. 

Definition 8. With definition in a GCA system S as follows: 

• In the pattern definition of the machine, let a^ '■— send(a[a;]) and a^ := send(a[a::]) be the predecessor and 
successor send operation for ap :— receive(a[a; © 1], . . . ^a[x © n\) in a, i.e., a < [3 < ^. 

• On machine i, define Tsend,a,i be the clock reading from C where the action a a happens. If cTq, does not exist, 
then we let Tsend,a,i — —00. 
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Figure 3. Internal non-determinism due to OS scheduling policies. 



• On machine j, define Tjieceive,i3,j be the clock reading from C where the action ap happens. 

• On machine k, define Tsend,'y,k be the clock reading from C where the action cr-y happens. If <7^ :— send(a[a;]) 
does not exist, then we let Tsend,-y,k — 00. 

then a GCA system with n-redundancy satisfies the deterministic assumption if for all machine i,j,k, Tsend,a,i < 

'^Receive, f3 J ^ ^Send^j 

Intuitively, this means that before machine j starts a receive action for variable a[j 1],..., a[j © n], all messages 
sent to j earlier with contents related to a[j ® 1], a[j (B n] should be ready in the local network queue Qj, but an 
unrelated message (a-^ :— send(a[a:])) should not arrive earlier. 

IV. Impact of Deterministic Assumption 
In the following, we discuss how the use of deterministic assumption influences verification and implementation. 

A. Deterministic Assumption Simplifies Verification 

For verification, by introducing deterministic assumption we have explicit control over the sequential ordering. 
Regarding some local properties, we can have an untimed synchronous model without false positives. 

Theorem 1. Let S =^ {Sr, Aq-jC) be a GCA system with n-redundancies satisfying the deterministic assumption, and 
Ssync be a GCA system where each machine in Sgync — {Su, At,C) executes synchronously in acf/onJj Then for 
verification conditions tp, S \= ip -i^ Ssync \= f if ^ has the following constraints. 

1) Property (p is in PLTL and does not use the operator X. 

2) There exists an i G 1 . . . n such that for all propositional variable p used in property ip, p is a predicate over 

Vi U Anexti> P is of the format exp = c, where exp :— exp + exp | exp * exp | exp/exp | I ~ 
(va) I const; Va G Vi\jAnexti, c and const are constants (in-machine/local properties without message queue). 

Proof: 

(Step 0: Preparation) To prove the theorem, we utilize theories of projection and stutter equivalence relation (for 
complete definitions, see |fT6)). 

• For a GCA system of n-redundancies, define be the projection over the execution trace which preserves variables 
in Vi and the next action A„ext, in machine i (e.g., {v^^, Anexti^){v^^, A^exu^) ...). 

• Let S be the set of alphabets, and Y : E* — > S* be the stutter removal operator which replaces every substring of 
identical alphabets by one alphabet (e.g., from xxxyyzzz to xyz), then two words ui,U2 G S* are called stutter 
equivalent if Y(wi) = Y(u2)- 

(Step 1: Stuttering over projections for asynchronous systems) Without loss of generality we assume ip describes 
variables Vi and the next action in machine i. For our proof, we only consider situations in one interval T, since the 
system performs periodically and can be proved inductively over T. For simplicity, in our notation, cto is equal to null. 
In (7, define a receive action receive(a[x 1],..., a\x © n]) valid if 

• there exists crfc^ := send(a[a;]) in a where ^2 < fci and, 

• Osend,raax > Orec,max, where 0send,max = MAXe^o.-./ci-ll^'ke Send(a[x]) Or null}, and 9rec,max = 

iyiAXe=o...fei-i{^|o'e ■— receive(a[a; © 1], ... , a[x n]) or null}. 

Without loss of generality, we assume the sequence a contains a receive() actions that are valid. We use 
receiveOj^x,y to represent the y-th valid receive() action, where it is the a;-th action in sequence a, executed on ma- 
chine j. Also we use aj ,^ to represent the x-th action in sequence tf, executed on machine j. With a valid receive() op- 
erations, we separate the execution of i into a+1 groups, namely {aiX, . . . ; cr^, 2:j_i}i{receive()- i, ■ ■ ■; o'j,a;2-i}2 ■ ■ ■ 
{receive()-_^^^^; . . . ; (Ti,k}a+i, where receive() - .^^ . . . ,receive(),^ .^.^^^ are valid. 

Consider the influence regarding the execution of machine j, whose execution sequence is the same as i. Note that with 
deterministic assumption, the execution sequence (consider globally) is not arbitrarily interleaved. For p = 1 ... a + 1, 
for the p-th group of machine i, 

'i.e., in each period, let be the clock reading from C where the fc-th action in a is activated on machine i, then for all i,j = 1 . . . n, 

T — T 



1) For the valid receive(a[i©l], . . . , a[i®n])i ^.^^^ action in i, its value is determined; the deterministic assumption 
assures that aj^x' ■= send{a[j]), x'^ < will be executed earlier than receive(a[i 1], . . . , a[i ® 'n\)i^X|^,^^, 
and aj_£ := send(a[j]), < x^ (if existed) will be executed later than receive(a[i 1], . . . , a[i 'n\)i^x ,tj.- 

2) For an invalid receive() action in i, it equals to a null operation. 

3) For other actions in j, their executions do not interfere receive(a[z 1], . . . , a[i n])i,x^,ii- Since for all other 
actions a.i^^ of i, it does not retrieve the value from the queue, and receive(a[i0 1], . . . , a[i(Bn])i j.^ fj, the first 
action in this group, is determined, it is determined based on the previous action (Ji^^-i, regardless of actions in 
J- 

Results from (1) to (3) imply that the execution of actions on j leads to stuttering states when the operator 0^ is 
actuating over the state space. 

(Step 2: Stuttering-projective equivalence) Let and Cb be the system configuration over the projection of 0^. Let 
Ca ^{cTfc } Cb represent the configuration change from Ca to ct caused by actions <Tk in machine i, and Ca -^i Cb represent 
the configuration change from to Cb caused by actions operating on deployed machines except i, and the advancing 
of the clock. For S, the projection of the execution trace over machine i with operator 0^ in T is 

Cii "^i Ci^ '■{(Ti,;} Cia "^j ^{(T2,i} '^«3 ^is ■ ■ ■ '^ik+l 

Consider Ssync where for all machine i = 1 . . .n, they execute synchronously in action level. The synchronous 
execution implies that in every T, each machine executes concurrently first cti, then a2, and so on. By applying the 
analysis similar to step 1, the projection of the execution trace for Sgync over machine i with operator 0^ in T is 

We thus derive the stuttering equivalence between projective traces of Sgync and those of S over machine 0^. 
(Step 3: Proof) We conclude the theorem based on the following statements. 

• The specification is an in-machine property, describing the state evolvement of machine i without terms related 
to message queues. 

• The specification without the next operator X, based on 1 16 1, is stutter-closed, meaning that ip can not distinguish 
two stutter-equivalent sequences. 

• For the deployed system S, the projective trace of S and the projective trace of Ssync are stutter equivalent over 
Qi. Thus S \= If Ssync \= which completes the proof. 

■ 

The result of the theorem can be discussed in four directions. 

• (Modeling) Previously, it is difficult to model the behavior of parallel machines which are asynchronous in action 
(micro-instructions) level but synchronous in the logical level. The result of the theorem brings significant ease 
for the verification model construction. Our previous attempt is to model the system using the model checker 
SPIN 1 10 1, which enables the modeling of asynchronous behavior. However, the channel definition only allows the 
synchronization of two machines, making the synchronous modeling difficult. 

• (Platform independent result) Furthermore, one immediate result from theorem 1 is that deterministic assumption 
enables us to "prove GCA system once, result valid for all". For any deployed systems, once they satisfy the 
deterministic assumption, properties always hold regardless of the actual execution time. 

• (Required time for verification) Regarding the run time of verification, our technique makes the verification of 
such systems practicable. Originally the GCA model is an asynchronous model; with the theorem we can construct 
a property-preserving verification model by synchronizing actions. For this verification model, the size of reachable 
state space traversed in verification is exponentially smaller compared to the original GCA model. 

• (Practicability of local properties) At first glance, the use of local properties seems to restrict the applicability. 
Nevertheless, since each machine can keep values sent from other machines, and fault-tolerance mechanisms mostly 
manipulate over these variables, our observation is that local properties constrained by the theorem are powerful 
enough for practical use (for examining fault-tolerance mechanisms). 

B. Deterministic Assumption Modifies Scheduling Policies 

1) Transition and transmission need time: In practice, actions and network transmission take time. Nevertheless, the 
application and theory still apply. Here we give an guidance regarding how deterministic assumption should be assumed 
in practice. 

Definition 9. With definition in an implementation of a GCA system S as follows. 

• Let an action of the type send() only broadcasts the message to the network without updating the queue of other 
machines. 

• In the pattern definition of the machine, let aa '■= send(a[a;]) and :— send(a[a;]) be the predecessor and 
successor send operation for :— receive(a[a; 1],..., a[x n\) in a, i.e., a < f3 < ^. 
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• On machine i, define Tsend,End,a,i be the time for machine i where the action a a finishes sending the message 
to the network. 

• On machine j, define Tn^ceive, Start, I3,j be the clock reading from C where the action starts. 

• On machine j, define Tfieceive,End,0,j be the clock reading from C where the action ap ends. 

• On machine k, define Ts^nd, start, p,k be the clock reading from C where the action starts. 

• Let Ti_j be the estimation of the worst case message transmission time between a message is sent from machine 
i and present in the message queue of machine j, and define Tnet = rTia.Xijgi...„Tij-. 

then an implementation of a GCA system with n-redundancy satisfies the deterministic assumption if for all machine 

'iyjfk, TgQYid,End,a,i ~^ ^net ^ ^Receive, Start, f5,j ^nd T/jgceiDe,i?nfi,/3,j ^ ^Send,f,k- 

2) Least constraint for scheduling: Previously, in Giotto the scheduling only needs to ensure that all actions (micro- 
instructions in Giotto) executed in a major-tick will not exceed the period of task T. However, with the deterministic 
assumption the implemented scheduling should be modified. Nevertheless, we argue that the additional constraint induced 
by deterministic assumption is the least, since it only captures the dependencies over actions. 

Also, modification of the scheduling policies can be minor For example, if the underlying execution platform is a 
synchronous system (e.g., Esterel, VHDL), then no modification is needed. If the underlying execution platform is a 
asynchronous system with RTOS, if in a no duplicate send actions occur, then one method can be applied for assuring 
the deterministic assumption: a fixed window (with length Tnet) can be set to distinguish between all sends (in the left) 
and receives (in the right), shown in fig. |4] 

At the same time, for a particular implementation to examine whether it satisfies the deterministic assumption can be 
checked using verification engines for timed systems, for example, UPPAAL \12\ or HyTech |8|. Note that it is merely 
the protocol (irreverent of data) that needs to be checked, which greatly releases the burden of those engine^ 

V. Effect of Faults 

In this section, we formally define the fault model used in FTOS, and discuss the effect when faults are actuating on 
machines. 

A. Formal Construction of Fault Models 

Definition 10. Define the set of possible faults over a GCA system S be IJj^]^ m^i' where each fault is a 8 tuple 
(act, type, cr, i, j, k, k' , ipk)- 

• act is a boolean variable. 

• type G {WrongResult, FailSilent,MessageLoss, Corruption, Masquerade}]^ /i the type of fault. 

• (T G atomic((f). 

• i € {1, . . . , n}, j g {1, . . . , \7f\} are fault-actuating indexes. 

• k,k' € {1 . . .n} are parameters. 

• ipk '■ range(cr) — > range((T) is the error function, where range((T) is the range of a. 

The effect of faults is summarized as follows. Without loss of generality assume every ^ mentioned is actuating on 
machine i. 

• Let ^ — (act, WrongResult, a :— a[m\ <— e, i, j, k, k' , tp), then 

• At time t, if the value of act is true, and A^^ := a[m] ^ e is activated, then for machine i, ipk ° <J updates 
the value of a[m] to '0(e). 

• Let ^ = (act, FailSilent, cr, fc, k' , i/j), then 

• At time t, if the value of act is true, and Aij :— a is activated, let vi be the previous configuration for Vi before 
actuating cr, and qi, q2, ■ ■ ■ Qn be the configuration of message queues Qi, Q25 • • ■ Qn- Then in effect ipk°cr does 
not update any value, i.e., Vi from Vi to Vi, and for all Qi, i — 1 . . .n, from to qi. 

• Let ^ = (act, Mess ageLoss, send(a[i]), i, k' .,ip), then 

"Complexity for the verification of timed systems is high, where we found it not suitable to have a model with data information be checked. 
^The set of types was influenced by the IEC-61508 standard concerning networking faults. WrongResult is used to describe internal computing 
faults. The fault of MessageDelay will be later described and MessageRepetition as well as IncorrectSequence are of no influence in our MoC. 



• At time t, if the value of act is true, and A,j ^ send(a[z]) is activated, let a be the previous configuration 
for a[i]. Let qi,q2, . . .Qn be the configuration of message queues Qi,Q2, ■ ■ ■ Qn- Then in effect '4>k° cr updates 
qs, s = \ . . .n, s^i, s^k to qs o {a[i],a), but qu to qk. 

• Let ^ = (act, Corruption, send(a[i]), i,j,k, k' , ip), then 

• At time t, if the value of act is true, and A^j := send(a[i]) is activated. Let a be the previous configuration 
for a[i]. Let qi, q2, ■ ■ ■ q,i be the configuration of message queues Qi, Q2, ■ ■ ■ Qn- Then in effect ipk ° updates 
qs, s ^ I . . .n, s ^ i, s ^ k to q^ o (a[z],a), but qk to qk o {a[i],k'). 

• Let ^ = (act. Masquerade, send(a[i]), i, j,k, k'jip), then 

• At time t, if the value of act is true, and A^.j := send(a[i]) is activated. Let a be the previous configuration 
for a[i]. Let gi, 52, • ■ • 9n be the configuration of message queues Qi,Q2, ■ ■ ■ Qn- Then in effect tpk ° <^ updates 
qs, s — I . . . n, s i, s ^ k' to q^ o {a[i\, a), but q^' to qy o (a[fc], a). 

In general, all faults can be viewed as being controlled by a fault automaton. 

Definition 11. A fault model over a GCA system S is a extended timed automaton A fault — [L, AP, T, I, E^Y,, Ui=i m ^*)- 

• L is set of locations. 

• AP — {acti, ...actm} is a finite set of atomic propositions, where act,; G B represents the activation status of 
a particular fault type. 

• T is a finite set of clocks. 

• / is the invariant condition mapping elements in L to clock constraints. 

• E is the location switch. 

• T, is a mapping L — s- 2^^ indicating the set of faults activating in the control location. 

• Ui=i m^i of possible faults. 

B. Properties 

During execution, the valuation of {acti, ...actm} can be viewed as a timed run (s, v) — {si,ti){s2,t2) ■ ■ ■ defined 
by the fault model automaton, where si,S2, ■ • • S B"* are valuations of atomic propositions, and ti,t2, ■ ■ • G K are 
durations. For the complete definition of timed run, see |T|. Let — (acta, type, (Ta, k, k' , ip). At time t, if acta 
is evaluated to be true by A fault, and ctq is executing, then the fault happens. 

In order to give the result where the theorem still holds with fault models, we introduce a series of definitions used 
as assumptions. 

Definition 12. Let A fault = {L, AP,T, I , E,Yj,[J-^^ m^i) the fault automaton actuating over the GCA system S. 
First define an infinite sequence {St,Tt) = (5*1, ii)(S'2, ^2) ■ ■ - as the transition triggering sequence of S where 

• For an integer i > 1, Si is the set of actions actuating on time ti. 

• Sequence ^1,^2,^3 ■ • ■ is a strictly ascending time sequence. 

Given a timed run {s,V) over {acti, ...actm} induced by Afauit, ond a transition triggering sequence (S't,T't) of 
S, define the actuating fault sequence Q(s,v) be an infinite sequence (sj, tY)(s2, %)(s3, %) ■ • ■ where 

• For all i>\,Sj is the set of actions infiuenced by {s,V). 

• tj, ^2 J % • • • 's ^ strictly ascending time sequence. 

Definition 13. Let (S'thTti) — (5*11, tn ) (6*21, t2i) ■ ■ • and (8x2, Ttz) ~ (•S'12, ti2)(S'22, ^22) . . . be two transition triggering 
sequences over a GCA system S. Define two actuating fault sequences C(si,i7i) over (Sti,Tt-^) and C(s2.i^2) over 
{St2 ! ) be effect-indistinguishable over S if the following holds. 

• Starting from time equals 0, for each interval with length T ( e.g., [0, T) , [T, 2T) . . .), let (Sia^ , iic^ ) (6*102 > ^ic(2 ) ■ • • 
{Siak^tiaJ and (S'2/3i,t2/3i)(<S'2/32:%2)---('S'2ft , ^2^) be the subsequences q/Cpi.Fi) and ((52,^2) contained in this 
interval. Then k = j, andym = I . . . k, Sia^^ — S2f3,^, i.e., two untimed actuating fault subsequences in the interval 
are identical. 

Definition 14. Let Ti, T2 be the fault model actuating on a GCA system S. Define T\ and T2 effect-indistinguishable if 
for all timed run (sj, i^) of T\, for its corresponding actuating fault sequence C(3i,Fi)> there exists a timed run (s2, J^) 
of T2 with corresponding actuating fault sequence (,1^2,^2) ■s^c/i that C(si,i7i) and (,(-^2^1) ai'c effect-indistinguishable 
over S, and vice versa. 

Theorem 2. Let S be a GCA system with n- redundancies satisfying the deterministic assumption, and Sgync be a GCA 
( global-cycle-accurate} system where each machine in Sgync executes synchronously in actions. Let T, Tgync be the 
fault model actuating on S and Ssync, respectively. If T and Tgync are effect-indistinguishable, then for verification 

conditions Lp, T X S \= (p ^ J^sync X Sgync H <(3. 

;/ (f has the following constraints. 

1) Property tp is in PLTL and does not use the operator X. 

2) There exists an i £ 1 . . . n such that for all propositional variable p used in property Lp, where p is a predicate 
over Vi U Anext,- 



Proof: The theorem follows immediately with definitions described above. 



C. From Theory to Practice 

1 ) Overapproximation of the fault model: Based on previous sections, we can construct the system model which is 
synchronous with property preservation. However, it may not be easy (or sometimes impossible) to construct a fault 
model which is effect-indistinguishable as synchronous models. In this way, an over-approximation of the fault model 
(where time is abstracted to ticks) is sometimes necessary. 

In FTOS, the occurrence between faults is measured by the least time between failure (LTBF). Let the system period 
be T and the LTBF be 77, then a extremely safe approximation for number of occurrence of faults during a period T is 
[^] . Since the above approximation can be too strong, a more desirable way to underapproximate the LTBF can be as 
follows: when 77 > 3T, the time for the occurrence of two consecutive faults should contain at least ^ 1 complete 
period^ When 77 is large, the set of behaviors defined in the approximated fault model is close to that of the original 
fault model. Since in practice the fault model is based on probability measures, the impact of such over-approximation 
of fault models is minor, meaning that a spurious counter-example is still a trace that the system can not defend. 

With the above descriptions, an untimed fault model compatible with the verification model of GCA can be con- 
structec0 

2) Scheduling policies in actual deployment: In actual deployment (implementation), the set of traces may be 
constrained compared to the original GCA model due to the scheduling policy of the local machine. The reminder 
is that with the fault effect masquerade, the synchronous model can demonstrate more possibilities than a deployment 
because the ordering for the correct and masqueraded messages in the synchronous model is not uniquely determined, 
whilst an actual deployment might have the order fixed due to the restricted behavior. 

3) Message delay: In our formal model construction, the message delay is not considered for simplicity reasons. 
Nevertheless, it is possible to extend the fault model with a timed queue for the storage of delayed messages. In 
practice, the error of message delay can be detected by our tool with an over-approximated fault model. 

VI. Implementation: FTOS-Verify 

We have implemented our first version of FTO^-Ver/^^I^ which is deployed as an Eclipse add-on for FTOS. In this 
section, we summarize the current features. 

1) It enables an automatic translation from FTOS models into verification models; choices of analysis techniques can 
be made between fault propagation and interval analysis. 

2) It automatically generates some formal properties regarding fault-tolerance. 

3) Also we elaborate on integrating existing model checkers into the Eclipse framework; under the Eclipse IDE the 
verification model can be checked, where verification results are reorganized and shown under the Eclipse console. 

4) Finally, since some counter example (in the trace file) may reach 300000 lines, we perform automatic analysis to 
prune out all unnecessary details such that the counter example is easy for non-verification experts to understand 
(around 300 to 700 lines). 

A. Template-based Model Generation 

In previous sections, we outline all actions, while the implementation details for each action (mechanism) are 
unknown. In reality, we extract all mechanisms from existing platform-dependent templates, and rewrite them into 
formats acceptable by model checker. Since for some implementation platforms, an action corresponds to a sequence of 
code which in effect performs the update, we hope to maintain the "shape" of sequential execution while the model is 
executed synchronously - by doing so, these verification templates describing fault-tolerance mechanisms can be easily 
adapted to generate new templates for other platforms. To achieve this goal, we use Cadence SMV [14] as our back-end 
verification engine, for the reason that it supports an extended description format which resembles sequential execution. 

In FTOS, we apply template-based code generation to describe each action based on the meta-model layer. In this 
way, for an user defined model with different names (e.g., ports), when a model is generated, instructions and formal 
specifications will be annotated with correct names accordingly. 

B. Analysis Techniques 1: Boolean Program for Fault Propagation 

Currently verification of software systems reUes heavily on abstraction techniques. For example, in the SLAM 121 
project the program is converted into boolean programs based on predicate abstraction techniques. In our case, a 
simple abstraction technique is to apply /aw/f abstraction, where we define a value of a port to be either Correct 

"^The untimed synchi'onous model can still specify time, but with only finite granularity - the minimum time unit is decided by T specified in the 
GCA model. 

'still, the LTBF can be vary large for modeling. For this problem we have established some theoretical criteria to reduce the LTBF efficiently with 
property preservation 0. 

*The software can be retrieved on demand. 



-t-1 



Figure 5. The demonstrator of the balanced-rod system. 
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Figure 6. The software model of a improperly designed controller in FTOS (with urelevant details omitted). 



or Erroneous. These two values reflect the objective status of the value in the port. Combined with the platform- 
independent fault-tolerant mechanism, this method can provide good indication regarding the correctness of fault tolerance 
mechanisms. In FTOS- Verify, it is applicable for two built-in tests, namely TestPort Absolute and TestLiveness. 

At the same time, corresponding abstraction should be established for the specification. Here the specification is 
generated manually once accompanied with the code template for fault-tolerance mechanisms, and it can be continuously 
reused after being created. 

C. Analysis Techniques 2: Preliminary Bounded Interval Analysis 

In FTOS-Verify, we also experiment bounded interval analysis for examining the validity of the test TestPortRelative, 
where the user may specify the value domain for each port and input, and a richer set of specification can thus be verified. 

VII. Case Study: Balanced-Rod System 

By applying formal verification, we can test the applicability of new or built-in mechanisms under different fault 
models before deployment. Here we illustrate a case study of a balanced-rod system (for the hardware, see fig. |5]). The 
purpose of the system is to use a FID controller with some data acquisition boards to control the rod such that it keeps 
balancing while pointing upwards. We expect to have fault-tolerant abilities in the system, and our design intension is 
to use triple modular redundancy for parallel execution, applying suitable voting mechanism to preclude malfunctioned 
units. We illustrate the verification process with an unsuitable design in FTOS. 

A. Fault Model 

In this system, four possible fault configurations are possible, namely All_correct, l_2_correct, 2_3_correct, 
and l_3_correct, implying that the fault hypothesis only allows to have at most one ECU be faulty at any instance. 
The fault is limited to the erroneous reading from inputs or erroneous result of task functions. 

B. Software Model 

Fig. |6] is the software model of the balanced-rod controller. The job ControlJob contains one operation mode 
Main, where its execution strategy defines three parallel sequences (SeqOnNodel, SeqOnNode2, SeqOnNodeS). 
For SeqOnNode 1, it consists of one input read operation, one task, and several output operations. The input Analnput 1 
reads the data from the cache of the dedicated hardware and stores in the port MeasureAnd. The task function (e.g., 
PIDCtrll) retrieves values from some ports, performs stateless functions, and writes to some ports. Each output 



Correct_DigOutputl_Result : 

SPEC AG ( (REDUNDANCY_TRIGGER_TMR_operating_value=0) 
-> (ecul_local_ports_DigOutputl .Result = 0)); 

Figure 7. Specification automatically annotated by FTOS-Verify. 





— ^Il) ► 


PIDCtrll 

















ccu3 



l^IDClrl3 

— Ce3 



( m)-r- 



TestPortAbsoluteO 



M: Measure And 
E: ErrorSum 
R: Result 



Figure 8. The deployment model of a wrongly designed controller (with iiTelevant details omitted). 



operation (e.g., DigOutputl) retrieves the data from one particular port and actuates. The port MeasureAnd has the 
replication property equal to TRUE, meaning that in the deployment, the port is replicated to separate machines, but 
there will be no mechanism to guarantee the consistency between these replicated ports. 

C. Fault Tolerance Model 

Since only one machine should offer the output value, on the model level a state machine REDUNDANCY_TRIGGER_TMR 
is constructed for the decision making. For example, the state machine states that if the detected fault configuration is 
All_Correct (no machine is faulty) or l_2_Correct, then machine 1 is responsible for generating the output. The 
state machine will be translated and deployed into each sub-system. 

In this example, our fault-tolerance design first demands the system to perform the test TestPortAbsolute (pre- 
built in FTOS) on port Result. For details, see appendix. When the test fails, it simply performs the Ignore operation, 
and expects the mechanism of the state machine will work correctly, ruling out the possibility for faulty machines to 
generate the output. 

D. Properties 

One local safety property automatically generated by FTOS-Verify (e.g., annotated to machine 1) is in fig.|7] Intuitively, 
this property specifies the case: if machine 1 (ecul) is responsible for offering the output 

(REDUNDANCY_TRIGGER_TMR_operating_value = 0), then the value offered can not be faulty 
(ecul_local_ports_Outl . Result = 0). 

E. Result 

In the fault-propagation model, the result of model checking either reports the system correct, or indicates a counter- 
example showing a trace of fault emersion and propagation which leads to errors. In this example, the error trace 
indicates the situation when faults happen alternately between two fault configuration units. The scenario (reported by 
model checkers; here simplified for explanation) is as follows: 

• In early stages of the first period, input 1 1 generates faulty results and stores into 
ecul_global_ports .MeasureAnd. As PIDCtrll uses data from ecul_global_ports .MeasureAnd, 
the result generated by PIDCtrll is faulty, implying ecul_global_ports .ErrorSum and 
ecul_global_ports . Result being faulty. 

• In later stages of the first period, values of ecul_global_ports .Result, ecu2_global_port s .Result, 
and ecu3_global_ports .Result will be sent to other machines and examined using voting mechanism. For 
ecul, values from ecu2 and ecu3 are stored in ecul_port_f rom2 . Result and ecul_port_f rom3 . Result, 
respectively. All results of TestPortAbsolute from each machine indicate that ecul is faulty, and the 
REDUNDANCY_TRIGGER_TMR on each machine indicates that output should be generated by ecu2. Also, the error 
happened in machine 1 is ignored. 

• In early stages of the second period, input 1 2 generates faulty results and stores into 
ecu2_global_ports .MeasureAnd. Therefore, the result generated by PIDCtrl2 is faulty, implying 
ecu2_global_ports . Result and ecu2_global_ports . Error being faulty. Although II is not faulty, 
PIDCtrll still cannot generate correct result because it also relies on the value in ecul_global_ports . ErrorSum. 
Therefore, ecul_global_ports . Result is still faulty at this period. 



• In later stages of the second period, values of ecul_global_port s .Result, ecu2_global_ports .Result, 
and ecu3_global_ports . Result will be again sent and examined. As values of ecul_global_ports . Result 
and ecu2_global_ports . Result are faultjQ machine 1 and 2 will treat 3 faulty, whilst machine 3 treats 

1 and 2 faulty later when exchanging information regarding the correct status of machines, opinions of machine 
3 is precluded, and REDUNDANCY_TRIGGER_TMR decides that machine 1 is responsible for offering the output, 
which is undesired (although the system follows desired fault configurations). 

• A fix of the example is to add port unification strategies over port ErrorSum with communication after 
TestPortAbsolute(). For example, in ecul, take the median value among {ecul_global_ports . ErrorSum, 
ecul_port_f rom2 .ErrorSum, ecul_port_f rom3 .ErrorSum} and store it as an update for 
ecul_global_ports . ErrorSum. 

F. More testcases and counter examples 

The document of the software also describes cases with much complicated counter-examples for proving the invalidity 
of some fault- tolerant mechanisms under certain fault models. 



The time required to generate all models is approximately 4 second^ and the size of the generated model can vary 
between 7000 to 11000 lines. The process of verification and generation of the counter example takes nearly 60 seconds 
for the simplest case - time increases when complicated mechanisms are equipped in the model. For similar cases 
where the property is proven to be correct, the required time is no greater than 250 seconds. An earlier (preliminary) 
version of FTOS-Verify based on SPIN without using the deterministic assumption was not able to generate results in 
reasonable time (< 60 min) even for the simplest testcase. These numbers show a drastic reduction of the required time 
and demonstrate the effect of the deterministic assumption. 



Existing work on design or verification for fault-tolerance mechanisms can be categorized into two different categories. 
Within the first category, researchers are focusing on verifying the applicability of a single fault-tolerance mechanism 
based on a concrete fault-model. For the second category researchers are offering languages or methodologies for the 
use of verification, e.g., [3J, Q5J. Nevertheless, the above work does not place their focus on automatic model generation 
for verification. As the system models and the fault models influence the applicability of the mechanism, when they are 
modified, the corresponding verification models are required to be reconstructed. Construction and modification manually 
can be time consuming or error prone; in FTOS-Verify, the verification model is generated automatically as the model 
changes. 

FTOS is not the only system which focus on model based development for fault-tolerant systems. Pinello et al fYT\. 
also focus on model-level description of software models, hardware models, and fault models, and allow automatic 
synthesis from models to deployments; their model (FTDF) is based on the extension of the dataflow model. Formal 
analysis techniques based on fault-propagation are later proposed |13|. There are inherent differences between two 
MoCs which make it difficult to compare the results. Our observation is that the tree-construction algorithm (similar 
to backward reachability analysis) as proposed in FTDF does not really apply symbolic techniques commonly used in 
verification and might suffer from exponential complexitie^ we relate ourselves to the verification engine with symbolic 
state space manipulation, making the detection of counter examples in large systems possible. Secondly, our analysis is 
not restricted to reachability analysis but enables the use of temporal logic. Lastly, our theorem enables the state-space 
reduction. A conjecture is that our theorem can be further extended to give a supporting theorem for their MoCfj 




There are other model based tools for embedded control with verification abilities integrated, e.g., [18|, but verifying 
fault-tolerance mechanisms is not their primary focus. Also we find in most cases, the deployment from the model is 
synchronous, where FTOS is focusing on the deployment over either synchronous or asynchronous systems. 



Our contribution is summarized as follows. 

• We concentrate on the modeling and verification of fault-tolerant systems, and mathematically construct the GCA 
system, the formal model of FTOS, for the use of verification. 

• The deterministic assumption relates all implementations with common features regardless of the underlying MoC 
of the platform (synchronous or asynchronous). It also simplifies the model construction, and reduces the reachable 
state space for verification. Due to the redundancies in fault-tolerance systems, constrained local properties used in 

'This can be further inteipreted as the case where ecul and ecu2 hold the same erroneous value. 
^Programs are executed under a machine with Intel Dure Core 2.6 Ghz CPU and 2GB memory. 

'The complexities is 0(D^^^), where D is the average number of inputs in an actor, and M is the number of actors, given an undesired event, 
'"i.e., there exists a padding of no-ops for a FTDF model such that local properties can be verified synchronously. 



G. Efficiency 



VIII. Related Work 




IX. Conclusion and Future Work 



verification can resemble global specifications. The work can be also applied for the analysis of systems based on 

the FLET paradigms proposed within Giotto. 
• Regarding implementation, combined with existing templates in the synchronous platform and the asynchronous 

platform, the presented verification technique enables designers with limited knowledge in verification to test the 

applicability of fault-tolerance mechanisms formally. 
Except work with industrial case studies, one extending work is to eliminate the burden of the user by generating 
the task-impUcation-graph automatically. A task-implication-graph specifies for each output port, whether an incorrect 
function or an erroneous input will influence the generated value. If there exists user-code (in general they are task/control 
functions), then currently the translation into languages acceptable by SMV is done by designer^ For a task function 
with the set of input ports Pin, the powerset of input ports and the set inclusion relation C form a lattice (2^*", C). 
Therefore, by using techniques similar to monotone data-flow analysis ifTTI . a task-implication-graph can be constructed 
automatically. 

Finally, we will extend our modeling technique for the analysis and verification of various models, e.g., SDF Q (with 
hierarchical extensions) and FTDF models with a predefined scheduling. 

Appendix 

(Algorithmic Sketcli of TestPortAbslute) The mechanism of TestPortAbsolute consists of two rounds. Without 
loss of generality, let i, j and k be the deployed sub-systems, and let the algorithm describe behavior on machine i. 
In the first round, it compares received values from othermachines with the local variable. For machine i receiving 
from sender j, if the value is different, then it evaluates j be faulty. The result of the evaluation is stored in an array 
ki) = 2^ and sent to other machines. 
In the second round, as machine i receives the array from others, it performs a voting between elements in the set 
Zfc}, {ii:jj:jk]^ and {ki,kj,ki.}. The result of the voting finalizes the decision whether a machine is faulty or 

not. 
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